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The concepts of attention and intrinsic motivations are of great interest witiiin adaptive 
robotic systems, and can be exploited in order to guide, activate, and coordinate multiple 
concurrent behaviors. Attention allocation strategies represent key capabilities of human 
beings, which are strictly connected with action selection and execution mechanisms, 
while intrinsic motivations directly affect the allocation of attentional resources. In 
this paper we propose a model of Reinforcement Learning (RL), where both these 
capabilities are involved. RL is deployed to learn how to allocate attentional resources in 
a behavior-based robotic system, while action selection is obtained as a side effect of the 
resulting motivated attentional behaviors. Moreover, the influence of intrinsic motivations 
in attention orientation is obtained by introducing rewards associated with curiosity drives. 
In this way, the learning process is affected not only by goal-specific rewards, but also by 
intrinsic motivations. 

Keywords: attention shifting, curiosity, intrinsic motivations, reinforcement learning, action selection 



INTRODUCTION 

Attention and intrinsic motivations play a crucial role in cognitive 
control (Posner et al., 1980) and are of great interest in cogni- 
tive robotics. Indeed, attentional mechanisms and motivational 
drives are strictly involved in the process of guiding and orches- 
trating multiple concurrent behaviors. Attentional mechanisms, 
beyond their role in perception orientation, are also considered 
as key mechanisms in action selection and coordination (Posner 
et al., 1980; Norman and Shallice, 1986). The capability of select- 
ing and filtering the information is associated with the process 
of focusing cognitive and executive resources toward the stimuli 
that are relevant for the environmental and behavioral context. 
On the other hand, another key factor affecting action selec- 
tion is represented by the so called intrinsic motivations such 
as the curiosity (Baldassarre, 2011), which can indirectly affect 
action selection because of its influence on attentional shifting. 
For instance, the curiosity drive can attract the attentional focus 
toward novel stimuli and, consequently, can elicit the execution of 
actions which are not directly related to the current behavior or 
goal. Albeit there is not a clear consensus on how intrinsic moti- 
vations differ from the extrinsic ones (Baldassarre, 2011), their 
role in pushing human/animal beings to spontaneously explore 
their environment (Baldassarre and MiroUi, 2013) and to execute 
this activity only for their inherent satisfaction (Ryan and Deci, 
2000), rather than for satisfying some basic needs such as hunger 
or thirst (White, 1959; Berlyne, 1960), is widely accepted. 



In this work, we focus on the intrinsic motivation provided by 
the curiosity, which is considered as the main drive for humans 
to explore novel situations and to learn complex behaviors from 
experience (Berlyne, 1954; Litman, 2005). Recent studies have 
also shown that both attention and curiosity are strictly related 
to the dopaminergic system responsible for action driving. It is 
widely accepted, indeed, that dopamine affects both the reward 
excitement, fundamental in the learning process, and the demand 
of more attention by novel stimuli (NieouUon, 2002; Redgrave 
and Gurney, 2006; lepma et al., 2012). Unpredicted events can 
generate intrinsic reinforcement signals, which support the acqui- 
sition of novel actions. In particular, it has been shown that the 
dopamine release is triggered not only in response to unexpected 
environmental changes and goal-directed action-outcome learn- 
ing (Heidbreder and Groenewegen, 2003; Dalley et al., 2004), but 
also in response to the detection of novel events (Lisman and 
Grace, 2005). 

The typical approach adopted for modeling the dopamine- 
like rewarding system (Montague et al., 1996) and for coping 
with the problem of treating intrinsic motivations (Barto et al., 
2004; MiroUi and Baldassarre, 2013) is represented by the well 
known Reinforcement Learning (RL) process. Recent works have 
been proposed to incorporate models for novelty (Marsland et al., 
2000) and curiosity (Schmidhuber, 1991) within Motivated RL 
algorithms (Barto et al., 2004) providing accounts for behav- 
ior adaptation, action selection learning, mental development, 
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and learning of hierarchical collections of skills depending on 
the robot experience (Kaplan and Oudeyer, 2003; Barto et al., 
2004; Oudeyer and Kaplan, 2007; Schembri et al, 2007; Singh 
et al., 2010; Baranes and Oudeyer, 2013). Typically, within these 
approaches, RL is used to directly model and generate the action 
selection strategies. In contrast, we propose a system where RL 
is deployed to learn attentional allocation and shifting strategies, 
while action selection emerges from the regulation of attentional 
monitoring mechanisms (Di Nocera et al., 2012), which can be 
affected by the intrinsic motivation of curiosity. Our curiosity 
model is inspired by the interest/deprivation model proposed 
by Litman (2005), which captures both optimal-arousal and 
curiosity-driven approaches of curiosity modeling. Following this 
approach, attentional shifting mechanisms can be generated tak- 
ing into account not only extrinsic motivations, like mission goals 
and primary needs satiation, but also intrinsic motivations, like 
the need of acquiring knowledge (Litman's deprivation model) 
and the attraction toward novel stimuli and opportunity of learn- 
ing (Litman's interest-based model). In this context, we aim to 
investigate whether our account of curiosity and attentional reg- 
ulation learning is feasible and effective for the generation of 
attentional allocation and shifting strategies, whose side effect is 
an adaptive emergent behavior for the robot. We are also inter- 
ested in the impact of our model of curiosity on the learning 
process. Specifically, we want to assess whether the proposed 
intrinsically motivated system affects the progress in learning. 

We detail the approach by describing our intrinsically moti- 
vated RL model and analyzing its performance in a simulated 
survival domain. In this scenario, the robot is engaged in survival 
tasks such as finding food or water, while avoiding dangerous sit- 
uations. The goal is to learn attentional allocation and shifting 
policies that allow the robot to survive in the particular environ- 
ment. The system evaluation is based on a comparison between 
the performance of the attentional policies, which are learned 
with the curiosity model, with respect to the ones generated with- 
out taking into account the curiosity drive. The collected results 
show that our intrinsically motivated learning approach is fea- 
sible and effective. Indeed, the curiosity-driven learning system 
allows us to find satisfactory attentional allocation and shifting 
policies showing a faster convergence of the learning process, 
safer policies of the selected action, and a higher wellness state 
of the robotic system in terms of energy gained during the explo- 
ration of the environment. In particular, in the curious setting 
the robot behavior seems more flexible because endowed with 
an additional capacity of adaptation. Indeed, different attentional 
allocation (or shifting) policies, and consequently, various action 
selection policies, can be defined depending on the current level 
of curiosity. 

MATERIAL AND METHODS 
AHENTIONAL SHIFTING SYSTEM 

In this work, we refer to the attentional framework introduced 
by Burattini et al. (2010). Here, the attentional system is mod- 
eled as a reactive behavior-based system (Brooks, 1986; Arkin, 
1998), endowed with internal attentional mechanisms capable 
of distributing and shifting the attention among different con- 
current behaviors depending on the current saliency of tasks 



and stimuli. These attentional mechanisms allow the robotic 
system to supervise multiple concurrent behaviors and to effi- 
ciently manage limited resources. In contrast with typical works 
on visual attention (Itti and Koch, 2001), the Burattini et al. 
(2010) approach is not concerned with the orientation of the 
attention in the space (i.e., the field of view), but it is about 
the executive attention (Posner et al., 1980) and the temporal 
distribution of the attentional resources needed to monitor and 
control multiple processes. This model of attention is inspired by 
Pashler and lohnston (1998), where the attentional load due to 
the accomplishment of a particular task is defined as the quantity 
of attentional time units devoted to that particular task, and by 
Senders (1964), where attentional allocation and shifting mecha- 
nisms are related to the sampling rate needed to monitor multiple 
parallel processes. In particular, Burattini et al. (2010) propose a 
frequency-based model of attention allocation, where the incre- 
ment of the attention due to salient stimuli is associated with 
an increment of sensors sampling rate and of the behavior acti- 
vations. Specifically, starting from a behavior-based architecture, 
each behavior is endowed with an internal clock regulating its 
activation frequency and sensory sampling rate: the higher the 
sampling rate, the higher the resolution at which the behavior 
is monitored and controlled. The internal clock can increase or 
decrease the attentional state of each behavior with respect to 
salient internal/ external stimuli by means of suitable attentional 
monitoring functions. In this context, the internal stimuli are 
modeled as internal needs, such as, for example, thirst or hunger, 
while the external stimuli are associated with salient events or 
discontinuities perceived in the external environment. 

An explicative example of this behavior-based attentional sys- 
tem at work is presented in Figure 1 . The plot shows how the 
sampling rate of a behavior (for example a give task) changes 
(see Figure lA) depending on different stimuli (for example, the 
human hand speed, in Figure IB, and the distance between the 
human hand and the robot end effector, in Figure IC). It is pos- 
sible to observe that if non-salient stimuli are presented to the 
behavior, the attentional process monitors the environment in a 
relaxed manner, instead, if something salient happens, the clock 
frequency of the behavior is enhanced and more attention is con- 
sequently paid toward the stimulus. This general model permits 
to monitor and control different internal and external processes, 
shifting, from time to time, the allocation of computational 
and operational resources. Notice that, this adaptive frequency 
implicitly provides a mechanism for behaviors prioritization. 
Indeed, high-frequency behaviors are associated with activities 
with a high relevance and priority in the current operational 
context. 

Formalization of the model 

Following the approach of Burattini and Rossi (2008), we con- 
sider a Behavior-based architecture (Brooks, 1986; Arkin, 1998), 
where each behavior is endowed with an attentional mechanism 
represented by an internal adaptive clock. 

A schema theory representation (Arbib, 1998) of an atten- 
tional behavior is illustrated in Figure 2. This is characterized by 
a Perceptual Schema (PS), which elaborates sensor data, a Motor 
Schema (MS), producing the pattern of motor actions, and an 
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FIGURE 1 I Example of the attentional allocation strategies presented in Sidobre et al. (2012). (A) The sampling rate associated with the behavior of giving 
an object changes depending on internal or external stimuli; (B) the human hand speed and (C) the distance between the human hand and the robot end-effector 
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FIGURE 2 I Schema Theory representation of an attentional behavior 



attentional control mechanism, called Adaptive Innate Releasing 
Mechanism (AIRM), based on a combination of a clock and a 
releaser. The releasing mechanism works as a trigger for the MS 
activation (e.g., the view of a predator releases the escape behav- 
ior), while the clock regulates the sensors sampling rate and, 
consequently, the activation rate of the behaviors. The clock acti- 
vation rate changes following an attentional monitoring strategy, 
which can adaptively increase or decrease the clock frequency, 
according to salient internal and external stimuli. More formally, 
the attentional mechanism is characterized by: 



mtn^ 



rmaxi 



An activation period ranging in an interval [/i* 
where h is the behavior identifier. 
A monitoring function f (cr^ (t) , p'^ _ ^) : K" ^ R that adjusts 
the current clock period pj", according to the internal state of 
the behavior and to the environmental changes. 
A trigger function p(f, pf), assuming a 0/1 value, which 
enables/disables the data flow cr ''(f) from sensors to PS at each 
time unit. 

Finally, a normalization function (t>(f(cr^(t),p^j_ j)) : R ^ N 
that maps the values returned by / into the allowed range 



The clock period at time t is regulated as follows: 

p\ = p (t, p\_ i) . 0 (/(a^f), pt i) + (l - Pit, pt i)) • pt 1 

(1) 

That is, if the behavior is disabled, the clock period remains 
unchanged, i.e., p\_ j. Otherwise, when the trigger function is 1, 
the behavior is activated and the clock period changes according 
to the <t>(f). In order to learn attentional monitoring strate- 
gies, various methods such as Differential Evolution (Burattini 
et al, 2010) and RL techniques (Di Nocera et al, 2012) have 
been deployed, respectively for off-line and on-line tuning of the 
parameters regulating the attentional monitoring functions. In 
the following sections, we will present an intrinsically motivated 
RL (IMRL) approach to the attentional allocation problem in our 
frequency-based model of attention. 

INTRINSIC MOTIVATIONS: CURIOSITY MODEL 

Curiosity is an appetitive state involving the recognition, pursuit, 
and intense desire to investigate novel information and experi- 
ences that demand one's attention. In literature, we find two main 
theoretical accounts of curiosity: the optimal arousal model and 
curiosity-drive theory. The curiosity-drive model assumes that the 
main drive of curiosity is the reduction of uncertainty: novel and 
ambiguous stimuli cause a need for coherence restore that reduces 
the uncertainty. This reduction is considered as rewarding. This 
model is supported by studies showing that unusual situations are 
associated with approaching behaviors and attentional states (e.g., 
see the Loewenstein, 1994 knowledge gap/approach gradient). 
However, the curiosity-driven model cannot explain why biolog- 
ical organisms initiate exploratory behaviors without any stimuli. 
These situations are instead well explained by the optimal-arousal 
model (e.g., see the Spielberger and Starr, 1994 model). Following 
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this model, the biological systems are associated with an home- 
ostatic regulation of their arousal level: when the arousal level 
is under-stimulated, the organism is motivated to increase the 
arousal and to look for novel situations; in contrast, when the 
organisms is over-stimulated additional stimuli are evaluated as 
negative and associated with an avoidance behavior. While in 
the curiosity-drive model the reward is associated with uncer- 
tainty reduction, in the optimal arousal model, the induction of 
curiosity is directly rewarding. Also this model is not completely 
satisfactory, indeed, in this case an optimal arousal state should 
be maintained and the reward is directly associated with a feel- 
ing of interest, hence the gain of new knowledge could reduce this 
feeling and could be considered counter-productive (see Litman, 
2005 for a discussion). A combination of these two approaches is 
proposed by Litman (2005) with the interest/ deprivation model of 
curiosity. Here, both the satiation and the activation of curiosity 
can be rewarding: the interest-based curiosity is driven by novel 
stimuli and opportunity of learning, whereas the deprivation- 
based curiosity is driven by the uncertainty and the lack of 
knowledge. In Litman (2005) the interest/deprivation model of 
curiosity is then related to the neuroscience of the wanting and 
liking systems, which are hypothesized to underlie motivation 
and affective experience for a wide class of appetites (Berridge, 
2003). In the Litman model, wanting is associated with depriva- 
tion and need of knowledge, while liking is associated with the 
expected pleasure due to learning and knowledge acquisition. In 
Table 1 we show the Litman's classification. Here, in the case of 
high level of wanting and liking, curiosity is due to a need of 
knowledge and it is sustained by an interest; if wanting is low, but 
liking is high, information seeking is motivated by pure interest; 
in contrast, if wanting is high and liking is low, the need of knowl- 
edge is not associated with the anticipation of a pleasure. Finally, 
when wanting and liking are low, also the curiosity drive is inhib- 
ited. In this paper, we exploit a model of curiosity that is inspired 
by the Litman (2005) interpretation of wanting and liking. 

REINFORCEMENT LEARNING FOR AHENTIONAL SHIFTING 

Following the approach by Di Nocera et al. (2012), in this paper 
we exploit a RL algorithm to learn the attention allocation strate- 
gies introduced in section 2.1. In Di Nocera et al. (2012), a 
Q-learning algorithm is used to tune and adapt the frequencies 
of sensors sampling, whUe action selection is obtained as a side 
effect of this attentional regulation. In the following, we first recall 
the Q-learning algorithm and then we detail its application to the 
attentional shifting problem. 



Table 1 | Litman's classification of curiosity states with respect to 
high and low levels of liking and wanting (Litman, 2005). 

Liking Wanting 

Low High 

Low LL: Ambivalent disinterest LH: Need for uncertainty 

clarification 

High HL: Curiosity as a feeling HH: Curiosity as a feeling 

of "interest" of "deprivation" 



General description of the Q-learning algorithm 

Q-learning (QL) (Watkins and Dayan, 1992) is a learning algo- 
rithm for a Markov Decison Process (MDP). A MDP is defined 
by a tuple (S, A, R, Pa) where S is the set of states, A is the set of 
actions, R is the reward function R: S x A M., with R(s, a) the 
immediate reward in s e S after the execution of a € A; Pa is the 
transition function : S x A x S ^ [0, 1] € K, with Pa(s, a, s') 
probability of s' e S after the execution of a e A in s e S. A solu- 
tion of a MDP is a policy n : S ^ A that maps states into actions. 
The value function (s) is the cumulative expected reward from 
s e S following ji . The q-value Q(s, a) is the expected discounted 
sum of future payoffs obtained by executing the action a from 
the state s and following an optimal policy n*, i.e., Q(s, a) = 
{Rt+i + YV*(st+i) \ St = s, at = a}, with V* associated to ji*. 

In QL techniques, the Q-values are estimated through the 
agent experience after being initialized to arbitrary numbers. For 
each execution of an action at leading from the state Sf to the state 
S(_|- 1, the agent receives a reward rt-^-i, and the Q-value is updated 
as follows: 

Q(st, at) ^ (1 - at) ■ Q{st, at) + at{Rt+ i 

+ y ■ maXa,_,_^eAQ(St+l,at+l)) (2) 

where y is the discount factor (which determines the importance 
of future rewards) and a is the learning rate. 

Different exploration policies can be introduced to select the 
action to be executed trying to balance exploration and exploita- 
tion. Analogously to Di Nocera et al. (2012), in this paper we 
consider a Softmax method that selects the action to be executed 
through a Boltzmann distribution (Sutton and Barto, 1998): 

exp r 

Pa(a \s,Q) = ^ (3) 

E exp r 

bi;A(s) 

Here, the temperature r controls the exploration strategy: the 
higher the temperature, the closer the strategy is to a random 
policy (exploration); the lower the temperature, the closer the 
strategy is to Q(s, a) maximization (exploitation). Under suit- 
able conditions (see, for example, Watkins and Dayan, 1992), this 
algorithm converges to the correct Q-values with probability 1 
assuming that every action is executed in every state infinitely 
many times and a is decayed appropriately. 

Q-learning for attentional regulation 

In our setting, the QL algorithm is to be exploited to generate 
the attention allocation strategy. For this purpose, we introduce 
a suitable space state S'' for each attentional behavior, while the 
action space A'' represents a set of possible regulations of the 
clocks. Specifically, the action space spans a discretized set of pos- 
sible allowed periods P*" = {pp . . . , p^} for each behavior b (i.e., 
a'' coincides with P''). Since the current state s*" e S'^ should track 
both the attentional state (clock period) and the perceptive state, 
this can be represented by a pair = (p^, cr^), where p'' G P^ is 
the current clock period and cr'' e X'' is for the current percep- 
tive status. In particular, we consider the perceptive state of each 
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behavior as a discretization of the behavior perceptive domain 
using n equidimensional intervals X'' = {cr{', . . . , a^}. Therefore, 
the attentional allocation policy it^ : ^ A'' represents a map- 
ping between the current state and the next attentional period 

that should be learned by means of the QL algorithm. That 
is, given a reward function _R for each behavior, the algorithm is 
to find the optimal attention allocation policy for each 

state 5'' G S^, the activation period p'' € that maximizes the 
expected reward of that behavior. 

The resulting Q-table for a generic attentional behavior in Di 
Nocera et al. (2012) can be described by the Table 2. 

This approach to adaptive attentional allocation and action 
selection has been tested in a robotic setting (Di Nocera et al., 
2012). Starting from this model we will design our model of 
Intrinsically Motivated Reinforcement Learning. 

MOTIVATED RL FOR AHENTIONAL SHIFTING 

In this section, we extend the RL approach to attention allocation 
presented above introducing the effects of the intrinsic motiva- 
tion of curiosity. In particular, we rely on a curiosity model which 
is inspired by the interest/ deprivation model proposed by Litman 
(2005) and adapted to the behavior-based setting we consider in 
this work. More specifically, analogously to Litman, we associ- 
ated the liking mechanism to a direct reward related to novelty, 
however, our interpretation of the wanting system is slightly dif- 
ferent. Indeed, in the place of the cognitive deprivation model 
introduced by Litman, which cannot be easily accounted within 
the simple behaviors we are concerned with, we relate the want- 
ing mechanism to the need to explore and act. This is represented 
by a value that we called the residual energy: the higher the avail- 
able energy, the higher is the need to "consume" this surplus in an 
exploratory (hence, curious) behavior. More details will be pro- 
vided below and in section 4 where we present some concrete 
instances of the reward functions used to capture this model of 
curiosity. 

ACTION SPACE 

Analogously to Di Nocera et al. (2012), in our model, for each 
behavior h we introduce an Action Space representing the set 
of possible periods P'' = {pj , . . . , p^} for that behavior. That is, 
an action a'' is a possible assignment of a clock period which 
regulates the sampling rate and the activation frequency of the 



Table 2 | Q-values for a generic behavior, wiiere ^ represents tPie 
state space. 







Pi 


P2 


Pk 




Pi 


Q11.1 


Q1I.2 






Pk 


Qu,i 


Qu.2 


C^Tk.k 




Pi 


Onl.l 


Qn^.2 


OnT.k 




Pk 




Onk.2 


Onk.k 



associated behavior. As explained above, the idea is that the sys- 
tem does not learn directly the action to execute, instead, it learns 
the attentional policies (i.e., clock regulations with respect to its 
perceptual and attentional state). In this context, the action selec- 
tion is an indirect consequence of the attentional behaviors. In 
the curiosity- driven setting, different attentional shifting strate- 
gies will be learned depending on the level of curiosity of the 
agent. 

STATE SPACE 

In order to represent the curiosity state into the state space, we 
reformulate the State Space S'' of Di Nocera et al. (2012) intro- 
ducing a new parameter representing the degree of curiosity of 
the agent. In the extended framework, the state 5*" is determined 
by a triple (c*, p'', cr''), where, c^ represents the level of curiosity of 
the system, p*" is for the current clock period, and cr is the current 
perceptive state of a behavior b. In particular, for each behavior, 
the attentional monitoring period p'' ranges in a predefined set 
of possible values P^. Analogously, the perceptive state cr'' is suit- 
ably discretized in intervals representing sub-ranges of the input 
signal X^. Finally, the curiosity degree c^ ranges in an interval 
of the four values [LL, LH, HL, HH] representing four relations 
between wanting and liking values (low-low, low-high, high-low, 
high-high) which are inspired by the curiosity model definition 
introduced in Litman (2005) (see section 2.2). Therefore, the 
attentional allocation policy jr'' : S'' ^ A*" represents a mapping 
between the current state s'' and the next action a'' correspond- 
ing to the suitable period for the attentional monitoring p'', that 
is learned by means of the QL algorithm. 

REWARD FUNCTION 

Given the Q-Learning Actions and States Spaces, we can introduce 
the Reward function as a combination of extrinsic component, an 
intrinsic component and a dynamic weight between these two. 
While the extrinsic reward depends on the direct effect of the 
actions with respect to the behavior utility, in our curiosity model, 
the second reward is directly related to the pleasure of the nov- 
elty, hence to the level of liking. Instead, the wanting level is used 
to dynamically balance the relation between extrinsic and liking 
reward: the higher the need of information seeking, the higher the 
liking associated with the encountered novelty. As stated before, 
differently from Litman (2005), our assumption is that the level 
of wanting depends on a sort of (global) energy state of the agent 
(see section 4 for additional details in the case study). The idea 
is that the robotic agent can explore new situations, guided by 
curiosity, only when the system is in a wellness state. Instead, 
when the system is under a certain wellness threshold, the atten- 
tion is focused on priority needs (e.g., to eat and drink) rather 
than on secondary ones (information seeking and exploration of 
new states). We formalize the overall reward function as follows: 

= {l-w).Rl + {w) ■ R\ (4) 

where -R^ is the reward computed considering the observed state, 
and _R|' represents the reward evaluated considering the satisfac- 
tion of an observation with respect to a particular curiosity state 
(i.e., the reward is related to something that the agent likes just 
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because it is novel). The value of is thus computed as level 
of liking. The w value represents the level of wanting, an inter- 
nal unmotivated need to explore something (the drive toward a 
specific location/object depends on the liking mechanism). 

This relation between liking and extrinsic rewards implies that, 
when the situation is critical (i.e., low energy) the _R; reward value 
will be neglected with respect to the _Re extrinsic reward value, 
while Ri will be emphasized as much as the agent will be in 
a wellness state. The possible correspondences between the _R;, 
Re rewards and the wanting, liking values are illustrated in the 
Table 3. Notice that this matrix of wanting and liking relations is 
different from the one by Litman (2005), because of the different 
interpretation of the wanting system. For example, here low want- 
ing and liking levels are associated with the prevalence of extrinsic 
rewards, while in Litman (2005) they are directly associated to a 
boredom state. 

CASE STUDY 

In order to test our approach we introduce a Survival Domain, 
where a robot must survive for a predefined amount of time 
within an environment (see Figure 3) avoiding obstacles (objects, 
walls, etc.) and recharging energy by eating and drinking. 

We consider simulated environments of 16 m^. Obstacles, 
water, and food locations are cubes of size 0.5 m x 0.5 m x 0.5 m, 
respectively of black, blue, and green color (see Figure 3). An 
experiment ends in a positive way if the robot is able to survive 
till the end of the test (max_time), while it fails in the following 
three cases: (1) the robot collides with an obstacle or its dis- 
tance from an obstacle is under a certain safety distance threshold; 
(2) the value representing the robot thirst goes under the min- 
imum value; (3) the value representing the hunger goes under 
the minimum value. We tested our approach using a simulated 
Pioneer3-DX mobile robot (using the Player/Stage tool Gerkey 
et al, 2003), endowed with a blob camera and 16 sonar sensors. 

Internal needs functions 

We assume that the robot is endowed with internal drives. In our 
case study, we consider two internal needs: hunger and thirst. 
These are modeled by the following functions. 

We introduce a Hunger function, to compute the need for food: 

Hunger(t) = Hunger{t — 1) -|- fc • {nb_act) 

— (cf ■ food_consumed) (5) 

Here, the hunger increases the need for food at each machine cycle 
by a fc value, for each active behavior {nb_act), and decreases it 



Table 3 | Wanting and liking relations and the associations between 
liking and extrinsic rewards. 

Liking Wanting 

Low High 

Low LL: Re >> /?/ LH: Re < Ri 

High HL: Re > Ri HH: Re << R/ 



when a quantity of food is ingested {food_consumed), depending 
on the energy power of the food (ef). 

An analogous Thirst function is used to compute the need for 
water: 

Thirst{t) = Thirst(t - 1) -|- fc • (nb_act) 

— {e„ ■ water _consumed) (6) 

AnENTIONAL BEHAVIOR-BASED ARCHITECTURE 

We introduce a Behavior-Based Attentional Architecture (see 
Figure 4) where, the attentional control is obtained from the 
interaction of a set of three parallel attentional behaviors AVOID, 
EAT and DRINK. 

For each behavior, the process of changing the rate of sen- 
sory readings is interpreted as an increase or decrease of selective 
attention toward internal or external saliences. These sources of 
salience are generally behavior- and task-dependent; these can 
depend on either internal states, such as hunger, thirst, etc., or 
external stimuli, such as obstacles, unexpected variations of the 
environment, attractiveness of a particular object, etc. The over- 
all attentional behavior should emerge from the interrelation of 
the attentional mechanisms associated with the different primi- 
tive behaviors and learned by means of the motivated RL learning 
technique. 

In Figure 4 we illustrate the attentional control system 
designed for the survival domain. It combines three behaviors: 
AVOID, EAT, and DRINK, each endowed with its releaser and 
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FIGURE 3 I The testing environment is simulated through the 
Player/Stage tool for robotics development (Gerkey et al., 2003). We 

adopt a simulated Pioneer3-DX mobile robot endowed with a blob camera 
and 16 sonar sensors. The black, blue, and green colored cubes (of size 
0.5 m X 0.5 m x 0.5 m) within the environment represent respectively 
obstacles, water, and food. 
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FIGURE 4 I Attentional behavior-based architecture with intrinsic motivations. 



adaptive clock. The output of the robotic system is the combi- 
nation of the outputs (if they are available). 

AVOID 

Manages obstacle avoidance. Its input signal the mini- 

mum distance of the 8 frontal sonar sensors; its motor schema 
controls the robot linear and angular velocity (v(f), 0{t)) gener- 
ating a movement away from the obstacle. The obstacle avoidance 
is obtained as follows: v(t) is proportional to the obstacle proxim- 

ity, i.e., v{t) = v„ax ■ , where Vmax and cr^™ > ^''^ respectively 
the maximum velocity and the maximum sonar range; 0{t) is 
obtained as weighted sum of the angular velocities generated by 
the active sonars, i.e., 6(t) = J2iiEA(t) ^max ■ where A{t) is the 
set of active sonars detecting an obstacle at time f, S^ax is the 
maximal rotation, 6*; is a suitable weight depending on the sonar 
position (high values for frontal sonars and low for lateral ones). 

EAT 

Monitors an internal function Hunger(t) representing the need 
of food. At each execution cycle the Hunger function changes as 
described in the previous section. Therefore, EAT is active when 
^eat _ i{nngQY{t) goes above a suitable threshold a^^. When 
enabled, if a green blob (representing the food source) is detected 
by the camera, the motor schema generates a movement toward 
it, otherwise it starts looking around for the green, generating a 
random direction. 

DRINK 

Monitors a function Thirst(t) that represents the need of water 
and considers the height (pixels in the field of view) of a detected 
blue object in the environment as an indirect measure of the 
distance from the object. The motor schema is enabled whenever 
the af""'' = Thirst(t) is greater then a suitable threshold ff*^''. 
When enabled, if a blue blob is detected by the camera, the 



motor schema generates a movement toward it, otherwise it starts 
looking around as for the EAT behavior. 

For each behavior, the clock regulation depends on the moni- 
toring function that should be learned at run-time. 

MOTIVATED AHENTIONAL FRAMEWORK 
Action and state spaces 

In order to cast the RL problem in our case study, we have to 
define A''s and S^s. In our attentional allocation problem, for each 
behavior, the action space A'' is represented by a set of possible 
periods {pj, . . . , p|!} for the adaptive clock of each behavior b. In 
the case study, for each behavior (AVOID, EAT and DRINK) we 
assume 1 machine cycle as the minimum clock period and the 
following set of possible periods: p"^, p'^ = {1,4,8, 12}. As for 
the state space S^, we recall that each state is a triple {a^ , c^) 
composed of a value in the perceptual domain, a period, and a 
curiosity value. The perceptive state of each behavior is obtained 
as a discretization in six equidimensional intervals of the per- 
ceptive domain . The perceptive domain for AVOID 
spans the interval [0, o'^a'^"'], where o"^™'' is maximum sonar 
range for the behavior; the domain of DRINK is [0, cr^™*^], where 
^ma"^ represents the maximum value for the Thirst function; the 
EAT domain is in [0, Omaxi' where cr^a^ is the maximum state 
of hunger the robotic system can assume. The curiosity value 
ranges in the conceptual interval [LL, LH, HL, HH], where the 
combination of the wanting and liking parameters is considered. 

Rewards 

We assume the reward always positive except for a strong penalty 
if the system cannot survive. For the other cases the reward is 
computed as follows. For each behavior, the extrinsic reward 
has two additive components. The first evaluates the impact of 
frequent activations of a specific behavior. The higher is the fre- 
quency, the smaller is the obtained reward. This component is 
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equal to zero if = P ■ ■ The second component depends on 
the specific behavior. 

In particular, concerning AVOID, each activation is rewarded 
directly proportional to the distance from the obstacle. 



if ! crash 



penalty, 



otherwise 

(7) 

As for EAT behavior, for each activation the reward is inversely 
proportional to the current hunger value. That is, a system that is 
more hungry takes a smaller reward. 



K it) 



[( fr_S ) + (l-^)1' if-'^rash 



penalty. 



otherwise 



(8) 



Analogously, each activation of DRINK is rewarded in inverse 
proportion to the current value of thirst: 



Rf'"''(t) 



_ ^drink 
_£min 



^drink ^drink 

Fmax Ptnin 



+ 1 



drink \ 
drink J 



if ! crash 



penalty. 



otherwise 
(9) 

Following the description of section 3, we subdivide curiosity into 
two components dealing respectively with the feeling of wanting 
and liking. We associate the first one to the concept of residual 
energy for the robot body, while the second one to the level of 
novelty in the exploration of the learning states. 

In particular, we assume that the Energy of the system is 
defined as follows: 

E{t) = E{t — 1) — e„ — e„{, • (nh_act) + ef ■ (food_consumed) 
+ e-^ ■ (water_consumed) (10) 

where the current value of the energy E{t) is computed start- 
ing from the previous level of energy E(t — 1), decremented of 
one unit of energy e„, which represents the energy consumed at 
each machine cycle. Then, we also consider the energy spent to 
activate each behavior e„y, where nb_act is the number of cur- 
rently active behaviors. On the other hand, we assume increments 
of the energy in correspondence of consummatory behaviors 
such as EAT or DRINK, where the added quantity ef {e„) of 
energy depends on the consumed food or water (this is added 
when boolean conditions related to food_consumed consumed 
and water _consumed consumed becomes true). 

According to the model of curiosity considered in this paper, 
we model the level of the wanting component of the curiosity as 
the residue of the Energy value (see Figure 5) ranging within the 
interval [0,1]. 



E{t)-E_well 

- E_well ' 

0, 



if £(f) > E_well 
otherwise 



(11) 



That is, the robot can show a curious behavior only when the 
situation is not critic (i.e., only when the global energy exceeds 
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FIGURE 5 I £(t) is the current Energy level; E_min: is the minimum 
amount of Energy permitting the system to work; E_well: is the level 
of Energy corresponding to a wellness state of the system. 



the Ejwell threshold, indicating a sort of wellness state of the sys- 
tem) . Ejwell is supposed to be associated with a state of the system 
where the regulation of the different behaviors activation periods 
is well balanced and leads to a suitable scheduling of the actions 
(reach food and water when necessary while avoiding obstacles). 
We can interpret this residual value E^ as the Energy that the sys- 
tem can spend on activities which are not associated with primary 
needs. In this way, the higher the Ec, the more the curiosity can 
drive the system to explore new states, the less the attention is 
posed on the primary behaviors (such as EAT, DRINK or AVOID). 
According to Equations (11, 4), c'' ranges only within an interval 
of three values [LL - HI, LH, HH]. LL and HI (i.e., both with 
low wanting) are considered as equivalent and correspond to the 
case of w equal to 0 (e.g., no curiosity). 

The second component of the curiosity is the liking, which we 
associate with the pleasure due to novel situations. In particular, 
the curiosity in our system is interpreted as the exploration within 
the learning states space. We can assume that the novelty of a state 
is computed as follows: 



NVjot 



(12) 



where, NV is for number of visits and '^^"^J represents the num- 
ber of times the percept aj' has been observed during the previous 
NV_tot observations. We, thus, maintain a sort of temporal win- 
dow of value NV_tot. In this way, on the one hand, we capture the 
novelty of the observation; on the other hand, we simulate a sort 
of lapsing mechanism where the novelty of a state is reduced when 
it is frequently visited within the time window. The model of the 
temporal window can be compared to the Itti's model of surprise 
(Baldi and Itti, 2010), by interpreting the temporal window as 
a rough approximation of a statistic on the perceptual history. 
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That is, if the system is not observing a percept for NV_tot times, 
the stimulus becomes likable again. While, if the system observes 
that particular perceptual state cr/' many times (i.e., NV_tot), the 
stimulus associated becomes boring. The Ri values range in the 
interval [0,1], so values grater than 0.5 indicates states with high 
liking. 

The combination of wanting and liking drives model the 
curiosity which will affect the learning system explorative 
attitude. 

Parameters and settings 

In Table 4 we summarize the parameters and the settings used for 
our experiments. 

Here, the perceptual domain (Perceptions) and the possible 
periods (PeriodsActions) are analogous to the ones presented in 
Di Nocera et al. (2012). Indeed, this partition for the perceptive 
domain and the periods have been selected to obtain a satisfactory 
setting for the non-curious system. As for curiosity, it is associ- 
ated with the residual Energy with respect to a threshold set as 
the 2/3 of the maximum energy E_max. The maximum energy 
value E_max is set with respect to the max_cycles that estimates 
the maximum clock cycles associated with an episode (180s to 
accomplish the survival task). This regulation is a compromise 
between scarce energy (that would keep the system in the non- 
curious state) and abundant energy (that would keep the system 
in the curious state). The minimal energy E_min is set to 300. 
Here, for each behavior activation we have an energy consump- 
tion of 1 UoE while the recharge is 150 UoE for food and water. 
Concerning the liking, we employed a temporal windows of 10 
observations to assess the novelty of a perceptive data. As far as 
the learning parameters are concerned, we set a = 0.8 for the 
learning rate, y = 0.9 for the discount factor and r = 1 for the 
temperature. These regulations have been defined after a prelim- 
inary phase of experimental testing in the non-curious setting 
(analogous to the one presented in Di Nocera et al., 2012). 

RESULTS 

In this section, we present the experimental results of a robot 
that must survive for a predefined amount of time within an 
environment (see Figures) avoiding obstacles (objects, walls, 
etc.) and meeting its energy needs by eating and drinking. We 



discuss the approach by considering the performance of the 
intrinsically motivated RL in learning attentional allocation poli- 
cies in this survival domain. In particular, our aim is to evaluate 
the effects of the curiosity on the RL process by comparing 
the behavior of the curious system (from now on called CR = 
CuriousRobot) with respect the one of system that is not endowed 
with the curiosity drive (from now on called NCR = Non — 
CuriousRobot). Namely, the difference between the CR and NCR 
models is that the latter does not consider the rewards due to the 
curiosity. Notice that the parameter regulation process described 
in the previous section was carried out in order to obtain the 
best regulation for the non-curious system. Since these settings 
are shared by the curious and non-curious system, we can assess 
the added value of the intrinsically motivated framework in the 
testing scenario. 

In order to evaluate how the curiosity affects the learning pro- 
cess, we first compare the survival time percentage of the CR 
with respect to the NCR. In Figure 6 we plot the survival time 
percentage averaged every 50 episodes. 
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FIGURE 6 1 Comparison between CR and NCR systems with respect to 


tile survival time percentage per Episode. The survival time is averaged 


every 50 


episodes. 







Table 4 | Table of the parameters experimental setting {UoE, Unit of Energy; UoR, Units of Reward; mc, machine cycles; m, meters; s, seconds; 
obs, observations). 

Experimental settings 



.y avoid 
^ max 
^ avoid 

min 

eat 
-'max 

eat 

min 

drink 
-'max 

drink 

min 



p , p . p 



Perceptions 

1.0 m 
0.4 m 
1500 UoE 
300 UoE 
1500 UoE 
300 UoE 
PeriodsActions 

{1 mc, 4 mc, 8 mc, 12 mc} 



Curiosity 

E_max 6000 UoE (4''max_cycles) 

E_min 300 UoE 

E_well (2/3)*E_max 
NV_tot 10 obs 

Learning parameters 
a 0.8 
y 0.9 

T 1 



Episode 

max_time 180 s 

max_cycles IBOOmc 

penalty -1500 UoR 
Power of Food/Water 

Bf 150 UoE 

Ow 150 UoE 

Ou 1 UoE 

Onb 1 UoE 
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As we stated before, the robot must survive in the testing envi- 
ronment for a predefined amount of time ( max_time) . The plot in 
Figure 6 shows that during the first 250 episodes the NCR system 
is more effective in surviving in the environment. In fact, the sur- 
vival time percentage starts from a value over 90%. That is, the 
NCR system is more effective in action selection than the CR sys- 
tem. This could be due to the fact that the curiosity, initially, leads 
the system to prefer the exploration of novel spaces rather than 
the goal-directed ones. However, after a while, the CR system 
starts to rapidly increase its survival time until it over pass the 
NCR system and reaches the convergence (100% of the survival 
time) around the episode 300, with respect to the NCR system 
that does not reach the convergence before 450 episodes. Hence, 
in both the cases we observe that the learning converges after at 
most 450 episodes, however, in the case of the robot endowed with 
curiosity an earlier convergence is obtained. 

In Figure 7, we show the cumulative rewards for each behavior 
during the learning process. The red lines describe the trend of 
the rewards gained by the system endowed with curiosity, while 
the blue ones are for the NCR system. As expected, during the 
first episodes the curious robot is not able to learn the atten- 
tional strategies needed to regulate the activations of the robot 
behaviors. The cumulative rewards related to the behavior AVOID 
of the CR system show that the performance remains unsatis- 
factory approximately until the episode 200. Then, the values of 
the cumulative rewards starts to increase and to converge from, 
approximately, the episode 300. In contrast, the NCR system 
shows a worst trend of the cumulative rewards for the EAT and 
DRINK behaviors. This could be explained by the fact that the 
robot is not guided by the curiosity to immediately explore the 
spaces of the environment where food or water are not observed. 
It only learns to eat or drink when the associated need functions 
(hunger and thirst) exceed a certain threshold; while the associ- 
ated behaviors remains always relaxed. That is, the learned policy 



for the NRC DRINK behavior always selects the maximum value 
for the period (p*'"'^ = 12) for all the states associated to low 
levels of thirst (i.e., from a\ to CT4). On the contrary, it selects 
the shorter period value (p*™*^ = 1) for the states with a high 
level of thirst (as and Of,). In the case of the CR robot, the pro- 
cess of learning is affected by the curiosity, which influences the 
robot behavior to explore spaces of the environment with food 
or water sources since it is immediately attracted by novel stimuli 
(including green and blue blob). Hence, the CR system learns to 
eat or drink also when this is not strictly required. For example, 
the learned policy for the CR DRINK behavior, in the case of low 
curiosity, associates the maximum value for p'^™^^ only to fewer 
states with low levels of thirst (from ai to a^), and it selects short 
period values [p'^™^ = 1) for all the other states (from 04 to erg). 
Finally, the learned policy for the CR DRINK behavior, in the 
case of high curiosity, always associates p*'"*^ = 1 or p'^""'^ = 2 
to all the levels of thirst (from cti io a^). At the end of the exper- 
iments NCR EAT and DRINK rewards converge to higher values, 
however, the global reward is higher for the CR. The global cumu- 
lative rewards are collected in the last plot of Figure 7 (on the 
bottom row), which shows the faster convergence of the CR sys- 
tem with respect to the NCR one. Finally, the CR learned policies 
for EAT and DRINK can always maintain the Energy value above 
the wellness threshold (this is also visible in Figure 10). 

In Figure 8A, we show, respectively, (A) the trends of the need 
functions within a single episode after the convergence of the 
learning process, and (B) the trend of the average value of the 
maximum value of the need functions among all the episodes. If 
we look at their trends, in Figure 8A we observe some periodi- 
cal path for the values of each function. We interpret the plots, 
in the case of the CR, as an effective learned attentional shifting 
policy of the behaviors EAT and DRINK. The robot seems to find 
a rhythmic alternation of its needs of eating and drinking (the 
decreasing part of the hunger and thirst functions corresponds 
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FIGURE 7 I Comparison between CR and NCR rewarding values during learning process evaluated for each episode (A) and considering the data 
averaged every 100 episodes (B). 



Frontiers in Psychology | Cognitive Science 



April 2014 I Volume 5 | Article 273 | 10 



Di Nocera et al. 



Intrinsic IVIotivations in Attention Allocation and Shifting 




Trends of the need funciions within one single episode. 



Average value cwnputed on i o ^isodes 












V 




1 1 1 





400 450 



Trend.s of the need funciions among all the episodes. 



FIGURE 8 I CR and NCR need functions comparison (A) within one run (single episode) and (B) average maximum value of the Hunger/Thirst 
function considering the data mediated every 10 episodes. 
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FIGURE 9 1 Safety function comparison. 





to the consuming of food or water, respectively). On the con- 
trary, the NCR system just waits to become very hungry or very 
thirsty before starting to search for sources of food and water. The 
behavior of the NCR robot, while on the one hand, driven by 
the thirst and the hunger need functions, achieves better results 
in terms of the single rewards, on the other hand, this does not 
lead to a global better reward for the NCR with respect to the 
CR (see Figure 7). This is also visible in Figure 8B where we can 
observe that the need to eat or drink for the NCR is, on aver- 
age, always greater than the CR needs. Thus, the CR system is 
able to find a configuration of the activation periods (i.e., suit- 
able attentional monitoring strategies), associated with the EAT 
and DRINK behaviors, such that the robot never suffers because 
of some internal need, leading to a best homeostatic regulation of 
the internal variables. 

In order to evaluate the performance of the two robots we 
defines a measure of safety as follows: 



Safetyit) 



— avoid f.avoid 
Fmax 



-pr 



X avoid 



rmax 



^avoid 



(13) 



where the level of safety is calculated with respect to the mini- 
mum distance between the current position of the robot and an 
obstacle. Here, the danger increases when the distance decreases 
and the AVOID activation period is relaxed; and, viceversa, the 
safety increases when the activation period of the AVOID is suit- 
ably balanced with respect to the distance from an obstacle. The 
improved performance of the CR system is visible in the evalua- 
tion of the safety function (see Figure 9), where we observe more 
pleasurable values for the CR robot, and of the energy function 
(see Figure 10), where the CR system is able to maintain the levels 
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FIGURE 10 I Energy function comparison. 



of energy EcWithCuriosity not only above the threshold of well- 
ness E_well, but also stabilized at a high value. The eighth row of 
Table 5 shows the average rewards of AVOID, EAT, and DRINK 
behaviors and the averages values of the global reward for the 100 
episodes of validation. 

All the results of the above plots are summarized in Table 5, 
where we evaluate the average values and the standard deviations 
on 100 episodes used to validate our system (after the convergence 
of the learning process). Regarding the global energy of the sys- 
tem, we already noticed that such values stabilized on a specific 
interval for the CR (see Figure 10). In Table 5, we can find that 
the average of the Energy mean values (= 3720) is a bit smaller 
than the NCR case (= 3809) and that its maximum value (= 4488, 
which is above the wellness threshold E_well ) is smaller than the 
NCR energy maximum value (= 4525). However, we suppose that 
the CR average of the energy mean values is smaller because of 
the curiosity (residual energy), which is "consumed" for explor- 
ing new states during the learning process. Interestingly, such an 
exploration of new states does not imply that the robot is less cau- 
tious in moving around. Indeed, the safety average value of the 
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Table 5 | Maximum, minimum and average values for the need functions (safety, hunger and thirst) and the energy function. 

Robot without Curiosity Robot with Curiosity 

Max Min Max Min 



Energy 
Safety 

Average value 

Reward 
Activation 



4525 ± 247 
0.97 ± 0.09 
Energy Safety 

3809 ± 335 0.38 ± 0.02 
Avoid Eat 

66 ± 2 83 ± 6 

109 ±3 105 ±7 



2683 ± 212 
0.12 ± 0.02 
Hunger Thirst 

420 ± 181 363 ±135 
Drink Global 

85 ± 5 235 ± 7 

103 ±2 260 ± 22 



4484 ± 344 
1.00 

Energy Safety 

3720 ±312 0.61 ±0.03 
Avoid Eat 

100 ± 9 70 ± 7 

204 ±19 130 ±10 



2781 ± 115 
0.18 ±0.04 
Hunger Thirst 

328 ± 142 311 ± 137 
Drink Global 

78 ± 11 248 ±15 

156 ±10 418 ±22 



Average values of the cumulative eat and drink rewards and of number of activations of tfie befiavlors after tiie learning process. 



CR is almost two times greater (= 0.61) than the NCR (= 0.38). 
Moreover, the minimum value for the CR safety (= 0. 18) is higher 
(so, more safer) than the NCR value (= 0.12). Finally, as noted 
in the plots (see Figure 8A) the need functions of hunger and 
thirst have smaller average values for the CR (hunger = 328 and 
thirst = 311), which means that the robot satisfies its needs more 
frequently. 

Both CR and NCR are effective in spending computational 
resources. This can be observed by considering the last row of 
Table 5, where we show the average number of the behavior acti- 
vations. The CR has a slightly greater number of activations, in 
particular for the AVOID behavior. This leads to an emergent 
behavior consistent with what discussed above. The CR robot 
eats and drinks more frequently and shows a safer behavior with 
respect to the NCR. However, 204 activations out of max_cycles 
(around 1150 in this specific case) possible activations (machine 
cycles of an episode) seems a satisfactory result for a behavior- 
based architecture (i.e., there is a reduction of the 83% of the 
number of activations). 

Moreover, notice that the global value shown at the end of 
this row states for how many cycles at least one behavior was 
active during the episode. In the case of NCR, this value is equal 
to 260. This shows that it is frequent to find more than one 
behavior active at the same time. For the CR robot, this value 
is equal to 418, meaning that for the most of the time only one 
behavior is active and the robot is able to orchestrate the multi- 
ple behaviors by opportunely shifting attentional resources, from 
time to time, toward the most salient one according to its need 
functions. 

Finally, another interesting result regards the curiosity influ- 
ence on the actual environmental exploration space. Indeed, 
while we expected that our intrinsic motivated RL would lead 
the learning process to improve the exploration of the inter- 
nal learning states, we did not expect that this would also 
produce an increased spatial exploration of the environment. 
This result can be illustrated by plotting the paths of the two 
systems during the overall experimentation (see Figure 11 C). 
By comparing the two generated paths, we can note that the 
system endowed with internal motivation (see Figure 11 A) is 
more explorative (the cumulative traces of 500 episodes cov- 
ered the 50% of the total area) with respect to the non-curious 
one (44% of the total area covered as shown in Figure IIB). 
The CR path seems smoother with a better coverage of the 



space around obstacles, food, and water while keeping the robot 
safe. 

DISCUSSION 

In this paper, we presented an intrinsically motivated RL 
approach to attention allocation and shifting in a robotic sys- 
tem. The framework has been demonstrated at work in a survival 
domain. Differently from classical RL models of action selection, 
where actions are chosen according to the operative/perceptive 
contexts, in our case the action selection is mediated by the 
attentional status of the robotic behaviors. In the literature 
we can find intrinsically motivated RL system where simple 
attentional control mechanisms are involved (e.g., eye move- 
ment in the playground domain in Barto et al., 2004); in this 
paper we tackle the attention allocation and shifting problem, 
which is novel in this context. Indeed, in our setting, the learn- 
ing process is to adapt and modulate the attentional strategies 
used to allocate attentional resources of the system. Specifically, 
our attentional mechanism regulates the behavioral activation 
periods, hence the amount of computational and operation 
resources dedicated to monitor and control the associated activ- 
ities. Following this approach, the global behavior of the system 
is not directly generated by an action selection policy (as in typ- 
ical RL approaches to action selection Sutton and Barto, 1998 
and intrinsically motivated RL Barto et al, 2004; Singh et al, 
2004), instead, it emerges as the sum of the outputs of mul- 
tiple parallel processes, each activated with its own frequency: 
the smaller the activation period of a behavior, the higher its 
influence on the global emergent behavior. Following the tax- 
onomy proposed by Baldassarre and Mirolh (2013), our system 
can also be considered as a competence-based system where the 
skill to be learned is the attentional allocation policy, however, 
this policy has only an indirect effect on the overall expected 
reward. 

As the main intrinsic motivation, we considered the curiosity 
drive which is inspired by the one proposed by Litman (2005). 
This model allows us to account for both optimal arousal and 
curiosity-driven approaches to curiosity modeling. In particu- 
lar, we related the liking and wanting drives of the Litman's 
model to, respectively, the pleasure of the novelty and the resid- 
ual energy of the system (the higher the energy value over 
the wellness state, the higher the drive toward to the explo- 
ration of novel situations and states). While several models for 
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Curious Robot coordinates in 2D space during the experiment Q Robot coordinates in 2D space during tile experiment Q Robot coordinates in 2D space during tlie experiment 




Curious Robot cumulative paths. Non-Curious Robot cumulative palbs. Overlap of the two cumulative patlis. 



FIGURE 11 I Space of the environment explored during the Curious (A— red line) or Non-Curious (B— blue line) Learning Systems execution. The third 
graph represents the comparison between the two systems (C). 



novelty-based and knowledge-based (Schmidhuber, 1991; Singh 
et al., 2004) curiosity have been proposed in the intrinsically 
motivated RL literature, the employment of the Litman account 
is less explored. Notice that we do not employ knowledge-based 
curiosity models (Schmidhuber, 1991; Singh et al, 2004). Indeed, 
while in Schmidhuber (1991) and later in Singh et al. (2004) and 
Oudeyer and Kaplan (2007) curiosity should lead the agent to 
explore areas of the environment where the learning progress is 
expected to be high, in our system, the agent is directly attracted 
by novel stimuli as sources of saliency. We want to stress here 
that the attentional problem addressed in our work is different 
from the ones mentioned since we learn attentional allocation 
only. In contrast to Schmidhuber (1991), Singh et al. (2004), 
and Oudeyer and Kaplan (2007), we can only enhance atten- 
tion with respect to the attracting stimuli, but the movement of 
the system toward the stimuli is obtained as an indirect effect. 
As for the novelty, the lapsing mechanism we defined for the 
liking function (the novelty of a state is reduced when it is fre- 
quently visited within the time window) can be related to the 
Itti's model of surprise (Baldi and Itti, 2010), but also to the 
approach proposed by Oudeyer and Kaplan (2007), where, once 
predictions within a given part of the sensorimotor space are 
learned, the system gets bored and starts to execute other actions. 
As far as attentional allocation and shifting is concerned, RL 
models have been mainly proposed for visual attentions and 
gaze control (Bandera et al., 1996), a theoretical link between 
visual attentional exploration and novelty-based intrinsic moti- 
vations is investigated in Schlesinger (2012) where the author 
investigates the way in which goal directed, top-down atten- 
tional skills can be incrementally learned exploiting complex 
novelty detection strategies. Differently from these cases, here we 
investigated an intrinsically motivated RL approach to the gen- 
eration of attentional strategies that are suitable for the executive 
control. 

Our approach has been illustrated and tested in a simulated 
survival domain, where a robot was engaged in survival tasks 
such as finding food or water while avoiding dangerous situa- 
tions. In this context, our goal was to show the feasibility and the 



effectiveness of the approach in a typical robotic domain where 
basic needs satisfaction and intrinsic (curiosity) motivations were 
clearly defined. In particular, we compared the performance of 
the intrinsically motivated RL with respect to the same set- 
ting except for the fact that the influence of the intrinsic drive 
was neglected. The parameter tuning was provided in order to 
find the best regulation of the non-curious setting to assess the 
added value of the curiosity drive. The collected results support- 
sthe hypothesis that the curiosity-driven learning system permits 
to find satisfactoryregulations of the attention allocation and 
shifting policies, providing different attentional policies,and con- 
sequently different emergent behaviors, depending on the current 
level of curiosity. Moreover, the overall behavior that emerges- 
from the execution of the learned attentional policies seems 
safer and capable of keeping therobotic system in a higher well- 
ness state during the environment exploration. This is related to 
thefact that the curiosity drive stimulates the attention toward 
opportunitiesof energy recharging (food and water) more fre- 
quently than in the non-curious system. Wealso observed that 
the curious system provides a more uniform exploration of the 
environment when compared with the non-curious behavior. 
While the presented tests illustrate the feasibility and effective- 
ness of the approach in a typical survival domain, the extension 
of this curiosity-driven attentional regulation method to more 
complex domains and more structured tasks (e.g., considering 
hierarchical skills Barto et al, 2004 and top-down attentional reg- 
ulations Schlesinger, 2012) remains to be investigated as future 
work. 
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